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Background properties in experimental particle physics are typically estimated using large col- 
lections of events. However, both the quantum mechanical nature of the underlying physics and 
statistical fluctuations can lead to different events exhibiting appreciably-different features. Al- 
though traditional background estimation techniques based on high-statistics control samples can 
provide a precise description of average background distributions, they are typically unable to de- 
scribe deviations of the shapes of probability distributions in small data sets from the corresponding 
high-statistics templates. From a physics analysis point of view, not taking such deviations into 
account when subtracting background can translate into increased systematic uncertainties and 
into degraded resolution of observables of interest. This article proposes a novel algorithm inspired 
by the Gibbs sampler that builds on a population-based view of particle physics data. Events are 
treated as heterogeneous statistical populations comprising particles originating from different pro- 
cesses such as a hard scattering of interest as opposed to background associated with low-energy 
strong interactions. The algorithm estimates the shapes of signal and background probability 
density functions from a given collection of particles by sampling from a posterior probability 
distribution that encodes information on which particles are more likely to originate from either 
process. Results on Monte Carlo data are presented, and the prospects for the development of tools 
for intensive offline analysis of individual events at the Large Hadron Collider are discussed. 



Keywords: 29.85.Fj; High Energy Physics; 
Particle Physics; Large Hadron Collider; LHC; 
background discrimination; mixture models; la- 
tent variable models; sampling; Gibbs sampler; 
Markov Chain Monte Carlo; Expectation Maxi- 
mization; Multiple Imputation; Data Augmenta- 
tion. 



1 Introduction 

When analyzing data in particle physics, a tradi- 
tional approach consists in estimating probability 
density functions (PDFs) from high-statistics con- 
trol samples. Generally speaking, such templates 
can be used to subtract background from a set of 
candidate events containing signatures of a pro- 



cess of interest. However, in principle, even if the 
physics processes can be exactly the same, both 
the quantum nature of the underlying physics and 
statistical fluctuations can lead to different events 
exhibiting appreciably-different features. When 
physics analysis leads to the identification of a 
large enough number of candidate events, such 
discrepancies with respect to the control sample 
PDFs are often of no practical relevance. How- 
ever, in some cases, most notably with reference 
to searches for new physics, analysis can result in 
only a few interesting events, and fluctuations can 
then lead to the presence of PDF features that 
cannot be described using high-statistics control 
samples. 

This contribution focuses on a novel sampling 



'Email: federico.colecchia@brunel.ac.uk 



I 



2 NOVELTY 



algorithm that aims to improve on control sample 
PDFs by estimating features associated with the 
presence of fluctuations in a data set under inves- 
tigation. In other words, the proposed approach 
makes it possible to estimate features of probabil- 
ity distributions that are associated with fluctua- 
tions in a collection of particles corresponding to 
a set of interesting events. One prospective goal 
of this research is the development of background 
discrimination techniques based on such refined 
PDFs, specifically concentrating on background 
subtraction on an event-by-event basis. Our future 
research in this direction will focus on estimating 
particle-level PDFs from individual events, with a 
view to using event-specific templates to associate 
individual particles inside events with a hard scat- 
tering of interest as opposed to background, e.g. 
that originating from low-energy strong interac- 
tions. In general, in the context of physics analy- 
sis, this is expected to translate into reduced sys- 
tematic uncertainties and into better resolution of 
observables of interest as compared to traditional 
techniques. 

2 Novelty 

The innovativeness of the proposed approach is 
two-fold. 

First of all, as opposed to what is tradition- 
ally done in particle physics, where entire events 
are normally classified into signal or background, 
the focus is here on individual particles inside 
events. A population-based perspective is em- 
ployed whereby events are treated as heteroge- 
neous statistical populations comprising particles 
associated with signal or background processes, i.e. 
events are treated as mixtures of different subpop- 
ulations of particles. A new algorithm inspired by 
mixture model decomposition techniques [T] is pro- 
posed to decompose such a mixture by sampling 
from a posterior probability distribution that en- 
codes information as to which particles are more 
likely to originate from signal as opposed to back- 
ground. In other words, background discrimina- 
tion is here reformulated as a classification prob- 
lem at the particle level, as opposed to focusing 
on entire events: instead of concentrating on the 
probability for a given event to contain a process 
of interest, the focus is here shifted to iteratively 



associating individual particles inside events with 
signal or background. As will be clarified in the 
following, this allows the algorithm to use control 
sample templates as initial conditions to estimate 
the effect of fluctuations on the shapes of particle- 
level PDFs. The implications of this will be elab- 
orated on from a broader perspective in section [HJ 

On top of proposing a new population-based 
view of particle physics events with a focus on in- 
dividual particles, the prospective goal of this work 
is the future development of novel dedicated tools 
for intensive offline analysis of individual events 
at the Large Hadron Collider (LHC), and more 
generally in particle physics. Being able to ob- 
tain high-precision signal and background PDF 
templates from single events will in fact be in- 
strumental in performing background subtraction 
event by event. In other words, it will be possi- 
ble to use event-specific PDF templates to reject 
background- associated particles event by event. In 
terms of physics analysis, since fluctuations will 
then be taken into account during background sub- 
traction, this is expected to result in reduced sys- 
tematic uncertainties and in better resolution of 
observables of interest. This is in turn expected to 
have an impact on the sensitivity in searches for 
new physics, as well as on performance in precision 
measurements. The proposed algorithm is a first 
step in this direction. 

One foreseen application at the LHC relates 
to the identification of candidate particles origi- 
nating from pileup (PU), i.e. from proton-proton 
collisions other than the one an event of interest 
has been triggered on. Residual contamination is 
in fact normally still present following selections 
that are typically applied to reject PU for the pur- 
pose of physics analysis, and the proposed tech- 
nique is anticipated to reduce this residual con- 
tamination by identifying which particles are more 
likely to originate from the hard scattering of inter- 
est as opposed to PU. This is expected to become 
more and more relevant as the LHC instantaneous 
luminosity is increased. We anticipate that this 
method, when used as a preprocessing step before 
jet reconstruction, will improve the resolution of 
jet observables. 

Our first investigation of the applicability of 
an algorithm inspired by the Gibbs sampler [2] 
to background discrimination in particle physics is 
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documented in [3]. However, it should be empha- 
sized that the proposed algorithm is not a proper 
Gibbs sampler, and that it more generally builds 
on different techniques such as Expectation Max- 
imization [3], Multiple Imputation |5], and Data 
Augmentation [6]. The algorithm ultimately re- 
sults from tailoring existing numerical techniques 
to the problem of background discrimination in 
particle physics, borrowing ideas from complemen- 
tary efforts in different areas of statistics research 
ranging from mixture decomposition to latent vari- 
able models. 

3 The algorithm 

The proposed approach to background discrim- 
ination is presented with reference to the gen- 
eral problem of decomposing a collection of par- 
ticles from high-energy particle collisions into sub- 
populations of particles associated with different 
physics processes and described in terms of differ- 
ent PDFs. 

The input data set consists of a mixture of 
particles, some of which originated from a hard 
scattering of interest, others from background as- 
sociated with low-energy strong interactions. Pro- 
vided that the corresponding subpopulations can 
be characterized sufficiently well in terms of their 
kinematic properties, it is in principle possible to 
ask, for each particle, what the probability is for it 
to originate from signal as opposed to background. 
The algorithm samples iteratively from a posterior 
probability distribution that contains information 
on which particles are more likely to originate from 
one process as opposed to the other. 

As compared to classical mixture models, 
which typically rely on a parametric formulation 
that requires the shapes of the subpopulation 
PDFs to be known a priori, our statistical model 
is based on a more general mixture of the form: 

K 

£«,■/*(*) (!) 

where subpopulation fractions ay ("mixture 
weights") are required to sum to unity, i.e. 
Y2j=i a j = 1- 111 the present application, x cor- 
responds to particle pseudorapidity rj, a kinematic 
variable related to particle polar angle 9 in the lab- 



oratory frame by the expression r\ = — ln(tan#/2), 
or to pt i.e. particle transverse momentum with 
respect to the beam direction. 

The PDFs fj are here defined in terms of reg- 
ularized histograms of x, namely based on cubic 
spline interpolation of histogram bin contents. Ad- 
ditional PDF boundary conditions were used in [3] 
in order to obtain a well-defined target distribu- 
tion for the associated Markov Chain. However, 
such constraints are not necessary with this im- 
plementation: particles are here mapped to signal 
or background based on control sample templates 
and not using iteratively-updated PDF estimates 
as was done in [3], where a version of the algo- 
rithm closer to the traditional Gibbs sampler was 
used. This aspect will be mentioned again in the 
following when the pseudocode of the algorithm is 
discussed. The symbol (fj will be used to denote 
an estimate of PDF fj corresponding to a splined 
histogram of x at a given iteration of the sampler. 

Throughout the paper, there will be a clear 
distinction between fj and ipj. Initial estimates 

ip^P of the subpopulation PDFs fj obtained from 
control samples are used in this implementation to 
associate individual particles with signal or back- 
ground on a probabilistic basis at each iteration 
of the algorithm ("mapping"). On the other hand, 
(fj is defined as a splined histogram of x at each 
iteration corresponding to a given mapping, and 
the distribution of x in the data set analyzed is ul- 
timately estimated in terms of splined histograms 
averaged over iterations. As opposed to the control 
sample templates, such PDF estimates describe 
features of the probability distributions that are 
associated with fluctuations in the data set ana- 
lyzed, as will be illustrated in section |4j with refer- 
ence to a Monte Carlo study. 

The choice of the statistical model (fT]) as op- 
posed to traditional parametric formulations was 
driven by our previous studies, where assuming 
a predefined PDF functional form led to signifi- 
cant bias on the mixture weights. The latter ap- 
proach in fact enforced a given functional form on 
the PDF estimates (fj, and the observed bias ulti- 
mately related to assuming that shapes estimated 
on high-statistics control samples were also appro- 
priate to describe the corresponding probability 
distributions in lower-statistics data sets. How- 
ever, as previously noticed, the latter data sets 



3 



4 MONTE CARLO STUDY 



sometimes deviate appreciably from the control 
samples because of the presence of fluctuations, 
and it is necessary for the statistical model to pro- 
vide more flexibility if the effect of fluctuations on 
the PDF shapes is to be described. While a rig- 
orous treatment may call for the use of nonpara- 
metric Bayesian methods f7j, which can provide 
an additional dimension of flexibility to statisti- 
cal models, it was decided to opt for a simplified 
intuition-driven approach for this study, in order 
to avoid the introduction of additional complica- 
tions at this stage. 

In general terms, given the mixture of prob- 
ability distributions ([1]) and a set of N observa- 
tions {xj}i = x r .. jv, the problem of clustering the 
latter into K groups by probabilistically associat- 
ing each of them with a distribution of origin has 
been solved numerically in a Bayesian framework 
using Markov Chain Monte Carlo (MCMC) tech- 
niques. The Gibbs sampler, which as anticipated 
directly inspired this work, belongs to this family 
of numerical methods. 

The basic pseudocode of the proposed algo- 
rithm is reported below. The value of variable v 
at iteration t is denoted t)W throughout. 

1. Initialization: Set = {or- }j, j = 

1, ...K, and obtain estimates tp^ of the sub- 
population PDFs fj using control sample dis- 
tributions as described in section [U 

2. Iteration t: 

(a) Generate the "allocation variables" 
zf), i = 1,...,N, j = 1,...,K 
based on the conditional probabili- 
ties P(4f = l|a?-%fU) = 

a^M^/f^VW The 
i=i 

quantity zf^ equals 1 when observation 
i is mapped to distribution j at itera- 
tion t, and otherwise. 

(b) Generate oft' from the probability den- 
sity function of a given zft^ = 

{ z ij~^}ij} p{(l\zft~ XS> ) ■ Knowledge of 
which particles are mapped to process 
j at iteration t — 1 makes it possible to 
generate the subpopulation fractions a 



at iteration t. A specific choice for the 
function p is described in section |4j 

One central idea of the algorithm is the follow- 
ing: the better the observations {xj}, are mapped 
to the subpopulations j = 1, K, the more accu- 
rate the estimates of p and of the subpopulation 
PDFs. Once some correct values of found, 
p and (fij begin to roughly reflect the correct distri- 
butions, which in turn leads to additional correct 
mappings Zij to be found at subsequent iterations. 

As opposed to the traditional Gibbs sampler 
for mixture models, where the stationary distribu- 
tion of the corresponding Markov Chain relates to 
the joint posterior of the allocation variables and 
of the subpopulation PDF parameters, this algo- 
rithm only samples from the posterior probabil- 
ity of the allocation variables, PDFs being defined 
nonparametrically and kept fixed at control sam- 
ple estimates. 

4 Monte Carlo study 

The algorithm was run on two different Monte 
Carlo data sets generated using Pythia 8.140 |8j 
[9]. The data sets were obtained superimpos- 
ing gg — > ti signal events from pp interactions 
at -y/s = 14 TeV with different numbers of low- 
energy strong interactions, so called Minimum 
Bias events, in order to simulate background. The 
signal process was chosen in order to illustrate the 
use of the algorithm for background discrimination 
at the particle level. The algorithm was also val- 
idated on toy Monte Carlo data sets as described 
in the appendix. Further studies will be needed in 
order to assess the potential of population-based 
techniques for physics analysis at the LHC. 

The pseudocode of this implementation of the 
algorithm is shown below. Subscripts sig and bkg 
relate to signal and background, respectively. 

1. Initialization: Set a bkg = a^ g = a sig = 
a fig = °- 5 ' wnere a bk g = ol\ and a sig = 
ct2 = 1 — Obkg- The initial estimates (fj 
of the subpopulation PDFs fj are given by 
splined one-dimensional histograms of r] and 
Pt obtained from the high-statistics control 
samples. 

2. Iteration t: 
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Figure 1: Control sample 1: Generator- level r\ (a) and px distributions (b) for signal and background 
particles with 2 GeV/c < pj- < 5 GeV/c. The distributions correspond to a total of ~ 33,000 par- 
ticles and are normalized to unit area. Superimposed curves result from spline interpolation, solid 
blue and dotted red curves relating to signal and background, respectively, (c) The corresponding 
two-dimensional (rj,pr) distribution. 




(a) (b) (c) 

Figure 2: Control sample 2: Generator-level r\ (a) and px distributions (b) for signal and background 
particles with 2 GeV/c < pt < 5 GeV/c. Distributions correspond to a total of ~ 48,000 particles and 
are normalized to unit area. Superimposed curves result from spline interpolation, solid blue and dot- 
ted red curves relating to signal and background, respectively, (c) The corresponding two-dimensional 
(ViPt) distribution. 
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(a) Generate z\- for all particles (i = 
1,...,N) and distributions (j = 
1,2 corresponding to background 
and signal, respectively) according 
to P(4f = llaf'V^f^) = 

1=1 

Both the nonparametric treatment of 
the PDFs and the use of ipj instead 

of to map particles to signal 

or background are deviations from the 
classical use of the Gibbs sampler with 
reference to mixture models. 

(b) Set af = Y.A^/N, j = 1,2. 
This corresponds to the simplest choice 
of setting p{aj\z^^ 1 ^) = 5(ctj — 

J2i z ij ^ /N) for the probability den- 
sity function of a given z. 

In general, the functions fj are the joint PDFs 
of 7] and pt corresponding to background (j = 1) 
and signal particles (j = 2). Since this study is re- 
stricted to charged particles with 2 GeV/c < px < 
5 GeV/c, correlations between r\ and pt can be 
neglected as a first approximation. For this rea- 
son, the joint PDFs take the form f s ig/bkg(ViPT) = 

fsig /bkg^) 1 fsig/\kg(P T ) ' 

The results presented in this article were ob- 
tained running the algorithm on two Monte Carlo 
data sets corresponding to different fractions of 
background particles: 

• Data set 1 comprises 1635 charged parti- 
cles in the kinematic region 2 GeV/c < p? < 
5 GeV/c, out of which 1269 originate from 
signal gg — > ti and 366 from Minimum Bias. 
The data set was obtained by generating 50 
gg —7- ti interactions, superimposing 7 Mini- 
mum Bias events on each signal interaction, 
and considering charged particles in the se- 
lected kinematic range. 

• Data set 2 corresponds to the same signal 
and background processes as in data set 1, 
but it contains a lower number of particles 
and a higher background fraction. Specifi- 
cally, 30 signal gg — > ti events were simu- 
lated and superimposed with 20 Minimum 



Bias interactions per signal event. The data 
set comprises 1314 particles, out of which 
768 originate from signal and 546 from back- 
ground. As in the previous case, charged 
particles in the kinematic region 2 GeV/c < 
Pt < 5 GeV/c were used in the analysis. 

In order to obtain the initial estimates yr- 
of the subpopulation PDFs fj, two high-statistics 
Monte Carlo data sets were used: 

• Control sample 1 contains a total of ~ 
33, 000 particles, and was obtained by gen- 
erating 1,000 gg —> ti events and superim- 
posing 7 Minimum Bias events per signal 
interaction. Charged particles were consid- 
ered in the kinematic range 2 GeV/c < pt < 
5 GeV/c. Figure [1] shows the corresponding 
rj (a) and px distributions (b) for signal and 
background particles. Superimposed curves 
result from spline interpolation, solid blue 
and dotted red curves relating to signal and 
background, respectively. Figure [2(c) shows 
the corresponding two-dimensional distribu- 
tion on the (t],pt) plane. This control sam- 
ple was used in conjuction with data set 1. 

• Control sample 2 contains a total of ~ 
48, 000 particles, and was obtained by gen- 
erating 1,000 gg —> ti events and superim- 
posing 20 Minimum Bias events per signal 
interaction. Charged particles were consid- 
ered in the kinematic range 2 GeV/c < px < 
5 GeV/c. Figure [2] shows the correspond- 
ing r] (a) and px distributions (b) for sig- 
nal and background particles. Superimposed 
curves result from spline interpolation, with 
the same color coding as above. Figure [2] (c) 
shows the corresponding distribution on the 
(j],pt) plane. This control sample was used 
in conjuction with data set 2. 

As for the number of iterations to be used with 
the algorithm, no rule is documented in the statis- 
tics literature with reference to related techniques, 
and the choice is generally problem-dependent. 
The number of iterations was set to 1,000 in this 
study, and probabilities were averaged over the last 
100. Runs were also performed letting the sam- 
pler run for a longer time: the algorithm exhib- 
ited a relatively-fast convergence on the data sets 
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analyzed, and no gain was found by choosing a 
higher number of iterations. Moreover, multiple 
runs were performed using different initial condi- 
tions for the mixture weights, in order to make 
sure that the algorithm was still able to converge 
to the correct PDFs. 

The sampler essentially uncovers a signal and a 
background subpopulation in the input collection 
of particles, based on the data as well as on initial 
conditions on the subpopulation PDFs provided by 
control samples. 

The signal and background r] and px distri- 
butions corresponding to data set 1 are displayed 
in figure [3l Points show signal and background 
r\ and pt distributions normalized to unit area. 
Solid blue (signal) and dotted red curves (back- 
ground) resulting from spline interpolation of the 
corresponding control sample distributions are su- 
perimposed with a view to highlighting the effect 
of fluctuations in the data. In particular, as com- 
pared to the corresponding control sample distri- 
bution, the signal rj PDF is not symmetric with 
respect to zero. 

The corresponding distributions from data set 
2 are shown in figure with the same color cod- 
ing as above. The plots again highlight deviations 
with respect to the control sample templates, par- 
ticularly with reference to the signal r\ PDF. 

5 Results 

Figure [5] shows the mixture weights estimated on 
data set 1 as a function of iteration number. Ini- 
tial conditions are shown by the horizontal dotted 
line, and correspond to ol\ = a% = 0.5 i.e. to no 
prior information about the fraction of background 
particles in the sample. Solid blue and dotted red 
curves correspond to signal and background, re- 
spectively, and horizontal solid lines indicate true 
values. As it can be noticed, convergence is es- 
tablished quickly, i.e. the burn-in phase is much 
shorter than typically reported in the MCMC lit- 
erature. We interpret this as a consequence of the 
main objective of the algorithm, namely to refine 
PDF estimates obtained from control samples: in 
other words, the sampler already starts from rea- 
sonable knowledge of the target PDFs, and the 
equilibrium distribution of the Markov Chain, al- 
though corresponding to an improved description 



of the PDF shapes that takes fluctuations into ac- 
count, is normally close to the initial conditions. 

Estimated signal (background) PDFs on data 
set 1, obtained by splining r] and pt distributions 
of candidate signal (background) particles aver- 
aged over the last 100 iterations of the algorithm, 
are displayed as curves in figure where points 
again correspond to the true distributions. A com- 
parison of these plots with those given in figure [3] 
shows that the deviation of the true signal r\ PDF 
in data set 1 from the corresponding control sam- 
ple distribution is correctly described by the sam- 
pler. More generally, the algorithm leads to an 
overall improvement in terms of the description of 
the PDF shapes as compared to the control sample 
templates. 

Ratios between control sample and true PDFs 
in data set 1 are given in figure [7) The corre- 
sponding ratios between estimated and true PDFs 
are shown in figure [HJ 

The plots obtained on data set 2 are given in 
figures PTOl through PT3l and the conclusions are sim- 
ilar. 

With reference to the mixture weights shown 
in figures [5] and [10] corresponding to data sets 1 
and 2, respectively, it is worth noticing that those 
estimates can in general exhibit a bias depending 
on the signal and background PDF shapes in the 
control sample and in the input data set. This is 
due to the algorithm repeatedly mapping particles 
to signal or background based on the control sam- 
ple PDFs, which, as previously mentioned, can be 
notably different from the true distributions. How- 
ever, the purpose of the algorithm presented here 
is to estimate the shapes of signal and background 
PDFs, not the mixture weights, and the estimated 
PDF shapes have been observed to be remarkably 
unaffected by a possible bias on the latter. For il- 
lustrative purposes, figure [9] shows the signal and 
background PDFs estimated on data set 1 with 
mixture weights kept fixed at values correspond- 
ing to an 80% relative deviation of the fraction 
of background particles from its true value. As it 
can be noticed, the PDF shapes are still correctly 
described. 
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Figure 3: Data set 1: Particle r] and px distributions (points), (a) Background r/. (b) Signal rj. (c) 
Background px- (d) Signal px- Distributions are normalized to unit area. Curves resulting from 
spline interpolation of the corresponding control sample distributions are superimposed with a view 
to highlighting deviations due to fluctuations in the data. Solid blue and dotted red curves relate to 
signal and background, respectively. 
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Figure 4: Data set 2: Particle r] and px distributions (points), (a) Background r/. (b) Signal rj. (c) 
Background px- (d) Signal px- Distributions are normalized to unit area. Superimposed curves result 
from spline interpolation of the corresponding control sample distributions, solid blue and dotted red 
curves relating to signal and background, respectively. 
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Figure 5: Data set 1: Mixture weights estimated by the sampler, shown as a function of iteration 
number. Solid blue and dotted red curves correspond to signal and background, respectively, and 
horizontal solid lines indicate true values. Initial conditions are shown as a horizontal dotted line, and 
correspond to ot\ = «2 = 0.5 i.e. to no prior information about the fraction of background particles. 
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Figure 6: Data set 1: Signal and background 77 and pr PDFs (curves) estimated by the algorithm, (a) 
Background r/. (b) Signal r/. (c) Background px- (d) Signal px- Points correspond to Monte Carlo 
truth, and superimposed curves to splined average histograms of ij and px corresponding to candidate 
signal and background particles, as described in the text. Solid blue and dotted red curves correspond 
to signal and background, respectively. 
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Figure 7: Data set 1: Ratios between control sample PDFs and true distributions, (a) Background r]. 
(b) Signal r]. (c) Background p?- (d) Signal p?- 



6 A broader perspective 

It is worth mentioning that, although this ap- 
proach is based on the formalism of mixture mod- 
els, similar results could have been obtained from 
a different perspective. In fact, the Gibbs sam- 
pler is strictly related to a limit case of Data 
Augmentation [6] where the number of imputa- 
tions is set to 1. In other words, a direct con- 
nection can be established between the proposed 
population-based perspective and latent variable 
models where complete-data methods are used re- 
peatedly in order to solve incomplete-data prob- 
lems. The reason for this similarity is, the asso- 
ciation between individual particles produced in a 
high-energy collision and the process they origi- 
nated from, corresponding to the allocation vari- 



ables 



is not accessible experimentally: the 



problem of using observables to estimate that as- 
sociation can be approached both by studying a 
given collection of particles in terms of a mixture 
of different subpopulations, and by treating Zij as 
latent variables. This article discusses a feasibility 



study of the former approach. Further insight may 
be gained from an investigation of both perspec- 
tives. 

Apart from the novelty of a population-based 
view of particle physics events, this contribution 
has also introduced three elements that are not 
present in the Gibbs sampler. First of all, subpop- 
ulation PDFs have been defined nonparametrically 
in terms of regularized histograms, as opposed 
to traditional parametric methods. This provides 
the statistical model with enough flexibility for it 
to describe the effect of fluctuations on the PDF 
shapes in the data: our previous studies have in 
fact shown that the use of a parametric approach 
enforcing given PDF functional forms ultimately 
results in significant bias on the mixture weights, 
corresponding to the equilibrium distribution of 
the Markov Chain not reflecting the structure of 
the posterior. Secondly, this algorithm differs from 
the Gibbs sampler in that the latter samples from 
posterior distributions by alternatively using two 
conditionals: on the other hand, this algorithm 
only uses the conditional distribution of the allo- 
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Figure 8: Data set 1: Ratios between PDF estimates obtained by the sampler and true distributions, 
(a) Background rj. (b) Signal rj. (c) Background px- (d) Signal pr- The plots show an overall 
improvement with respect to figure [71 
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6 A BROADER PERSPECTIVE 
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Figure 9: Data set 1: Signal and background rj and pr PDFs (curves) estimated by the algorithm 
with mixture weights kept fixed at biased values, namely ot\ = 0.4 and ai = 0.6, corresponding to 
an 80% relative deviation of the fraction of background particles from its true value, (a) Background 
7]. (b) Signal r/. (c) Background pp. (d) Signal pp. Points correspond to Monte Carlo truth, and 
superimposed curves to splined average histograms of r\ and pj- corresponding to candidate signal and 
background particles, as described in the text. Solid blue and dotted red curves correspond to signal 
and background, respectively. 
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Figure 10: Data set 2: Mixture weights estimated by the sampler, shown as a function of iteration 
number. Solid blue and dotted red curves correspond to signal and background, respectively, and 
horizontal solid lines indicate true values. Initial conditions are shown as a horizontal dotted line, and 
correspond to a\ = «2 = 0.5 i.e. to no prior information about the fraction of background particles. 
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Figure 11: Data set 2: Signal and background r/ and pp PDFs (curves) estimated by the sampler, (a) 
Background r]. (b) Signal r/. (c) Background pr- (d) Signal pp. Points correspond to Monte Carlo 
truth, and superimposed curves to splined average histograms of ij and px corresponding to candidate 
signal and background particles, as described in the text. Solid blue and dotted red curves correspond 
to signal and background, respectively. 
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Figure 12: Data set 2: Ratios between control sample PDFs and true distributions, (a) Background 
T]. (b) Signal r/. (c) Background px- (d) Signal px- 



cation variables, 2«, given the data, and the PDF 
shapes are estimated using averaged histograms. 
Finally, the idea of sampling using fixed PDFs as 
opposed to updating them during the execution of 
the algorithm is not traditional. 

7 Conclusions and outlook 

We have described a novel algorithm inspired by 
the Gibbs sampler that estimates the shapes of 
particle-level signal and background PDFs from 
a given collection of particles. This is done tak- 
ing into account the effect of fluctuations that can 
cause the PDF shapes to deviate notably from the 
control sample templates. The performance of the 
algorithm has been illustrated on two Monte Carlo 
data sets corresponding to tt production at the 
LHC, and has been cross-checked on toy Monte 
Carlo samples. Although this study has concen- 
trated on decomposing a given collection of parti- 
cles into one signal and one background subpop- 
ulation, the proposed framework can in principle 
be extended to combinations of multiple signal and 



background processes. 

It is worth mentioning that this article does 
not present the algorithm as a classifier, i.e. parti- 
cles are not explicitly classified into signal or back- 
ground at this stage. The sampler is rather pro- 
posed here as a tool to estimate the shapes of sig- 
nal and background PDFs from the data without 
relying exclusively on control samples. The PDF 
shapes are estimated from the data by averag- 
ing the distributions of candidate signal and back- 
ground particles corresponding to a probabilistic 
mapping, which is conceptually similar to the av- 
eraging step in Data Augmentation that aims to 
reduce the uncertainty associated with the pres- 
ence of latent variables. 

As anticipated, this algorithm is proposed as 
a first step towards the development of dedicated 
techniques for intensive offline analysis of individ- 
ual events at the LHC, and more generally in par- 
ticle physics. Background subtraction at the parti- 
cle level on individual events is in fact expected to 
contribute to increased sensitivity in searches for 
new physics, thanks to the improved resolution of 
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Figure 13: Data set 2: Ratios between PDF estimates obtained by the sampler and true distribu- 
tions, (a) Background r/. (b) Signal rj. (c) Background pp. (d) Signal pt- The plots show an overall 
improvement with respect to figure [ 
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9 APPENDIX: TOY MONTE CARLO STUDIES 



observables of interest. Since the numbers of par- 
ticles in the two Monte Carlo data sets chosen for 
this study are roughly in line with typical charged 
particle multiplicities at the LHC at the time this 
analysis was performed, the results presented here 
are an encouraging starting point for further devel- 
opment in this direction. More generally, studies 
are underway in order to extend the results be- 
yond those presented in this article, with a view 
to applying the sampler to analysis of individual 
events at the LHC. 

As noted in [3], efforts to eliminate noise in 
event-by-event analysis of high-energy multiparti- 
cle production are reported in the literature, most 
notably with reference to the study of dynami- 
cal fluctuations in heavy- ion collisions, where the 
notion of "event-by-event fluctuations" was intro- 
duced |10] . e.g. for mean transverse momentum 
and for mean transverse energy measurement. In 
the context of such studies, the focus is e.g. on 
obtaining analytical expressions for the moments 
of probability distributions: those expressions can 
then be used to eliminate statistical fluctuations 
from the data with a view to extracting infor- 
mation about the underlying dynamics [11] , Al- 
though those studies are conceptually related to 
the prospective goal of the approach presented in 
this article in that they aim to subtract noise from 
individual events, they are fundamentally differ- 
ent. First of all, one of the novel aspects of this 
work is the idea of concentrating on individual par- 
ticles inside events, reformulating background dis- 
crimination in terms of a classification problem at 
the particle level: the emphasis of this research on 
a population-based view of particle physics events 
distinguishes the proposed approach from previous 
efforts. On top of this, fluctuations are assumed 
to be Poissonian in while this method does 
not make such a requirement and is much more 
general. 

For the sake of completeness, it should also 
be noted that the proposed population-based per- 
spective can lead to conceptual issues with refer- 
ence to specific processes. For instance, when color 
connection plays a significant role, such as in Un- 
derlying Event, i.e. in interactions between pro- 
ton remnants and between other partons in the 
protons, it may not always be possible to unam- 
biguously associate individual particles with either 



signal or background. While we leave a more de- 
tailed investigation of this conceptual issue for fu- 
ture research, this approach is here proposed as a 
way of decomposing a given collection of particles 
into effective subpopulations thereby taking fluc- 
tuations into account. It is our opinion that this 
will cover many situations of practical relevance in 
physics analysis. 

Finally, it should be mentioned that the iter- 
ative nature of the algorithm can lead to a dis- 
advantage with respect to established techniques 
in terms of execution time. However, the running 
time of the sampler corresponding to 1,000 iter- 
ations on the Monte Carlo data sets used in this 
study was ~ 20 s on a 2 GHz Intel Processor with 
1 GB RAM, so still reasonable for offline use. In 
any case, given the parallelization potential of the 
sampler, which is related to a similar property of 
the Gibbs sampler [2], improvements may be pos- 
sible in this respect, for example using commod- 
ity Graphics Processing Units (GPUs) that have 
been used extensively both in particle physics and 
in other disciplines for compute-intensive applica- 
tions. 
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9 Appendix: Toy Monte Carlo 
studies 

The results from the Monte Carlo study described 
in section 2] were cross-checked on toy Monte Carlo 
data. Samples of ~ 600 signal and background 
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particles were generated according to r/ and px dis- 
tributions similar to those obtained using Pythia. 
Particle 7] and pr were generated independently: 
Gaussian PDFs centered at zero with standard de- 
viations comparable to those observed in Monte 
Carlo were used for 77, and p? values were gen- 
erated based on polynomial PDFs in the range 
2 GeV/c < pt < 5 GeV/c parametrizing the cor- 
responding Monte Carlo distributions. 

Additional cross-checks were performed by 
varying the toy Monte Carlo generation param- 
eters with respect to the nominal values. The al- 
gorithm was also run on different numbers of par- 
ticles in the input data set, with no appreciable 
changes to the results. 
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